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Abstract 

We propose a method to generate a warning system for the early 
detection of time clusters applied to public health surveillance data. 
This new method relies on the evaluation of a return period associated 
to any new count of a particular infection reported to a surveillance 
system. The method is applied to Salmonella surveillance in France 
and compared to the model developed by Farrington et al. 



1 Introduction 

Since the pioneering work of Serfling (see [IH]), several statistical models 
have been proposed to detect time clusters from specific surveillance data. 
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A time cluster is denned as a time interval in which the number of observed 
events is significantly higher than the expected number of events in a given 
geographic area. The term "event" is generic enough to include any event of 
interest such of illness, an admission to an emergency department, 

a death or any other health event. 

The published models can be classified into three broad approaches: regres- 
sion methods, time series methods and statistical process control as proposed 
by some recent reviews (see [20] , [7] , [H] ) . In most cases they are based on 
two steps: (i) the calculation of an expected value of the event of interest for 
the current time unit (generally a week or a day); (ii) a statistical compari- 
son between this expected value and the observed value. A statistical alarm 
is triggered if the observed value is significantly different from the expected 
value. 

The first step is based on the past counts, or more often on a sample of the 
past counts, that takes the seasonality pattern(s) into account. Thus, the 
current count is compared to counts that occurred in the past during the 
same time periods, e.g. the same week more or less three weeks for the last 
five years. Alternatively, sinusoidal seasonal components can be incorporated 
into regression models to deal with the seasonality and to easily take secular 
trends into account. More rarely, models try to reduce the influence of weeks 
coinciding with past outbreaks. One solution to avoid that such outbreaks 
reduce the sensitivity of the model is to associate low weights to these weeks 
(see e.g. [5]). 

The early prospective detection of time clusters represents a statistical chal- 
lenge as the models must take the main feature of the data into account 
such as secular trends, seasonality, past outbreaks but are also faced with id- 
iosyncrasies in reporting, such as delays, incomplete or inaccurate reporting 
or other artefacts of the surveillance systems. Reporting delays are partic- 
ularly problematic for surveillance systems that are not based on electronic 
reporting. Concerning non-specific surveillance systems, the same difficulties 
are encountered, excepted for the reporting delays because these surveillance 
systems are mostly based on electronic reporting. 

The intentional release of anthrax in the USA in October 2001 emphasized 
the need to develop new early warning surveillance systems (see |9].|18j). 
These surveillance systems treat an increasing number of data provided from 
multiple sources of information (see One logical consequence was to 

perform statistical analyses with a daily frequency. 

Developing automated statistical prospective methods for the early detection 
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of time clusters is thus essential. It is important for a public health surveil- 
lance agency to run several statistical methods concomitantly in order to 
compare the alarms generated by these methods. It is crucial to carry on the 
development of new methods because the combination of methods increases 
the sensitivity and the positive predictive value of the surveillance system. 
It is the reason why we propose in this paper a new approach based on 
Extreme Value Theory (EVT) (see e.g. [5]) for the early detection of time 
clusters. To illustrate the performance of the method, we applied it to the 
detection of time clusters from weekly counts of Salmonella isolates reported 
to the national surveillance system in France. 

Salmonellosis is a major cause of bacterial enteric illness in both humans 
and animals, with bacteria called Salmonella. In France, Salmonella is the 
first cause of laboratory confirmed bacterial gastroenteritis, of hospitalization 
and of death. In 2005, a study estimated that between 92 and 535 deaths 
attributable to non typhoidal Salmonella occurred each year (see [22]). 
The paper is organized as follows. The surveillance system and the data are 
presented in Section 2. A description of our method to check if each new 
observation corresponds to an unusual/extremal event is given in Section 
3. Applications to counts of Salmonella as well as a comparison to the 
Farrington method (see [6]) are developed in Section 4. A discussion follows 
in the last section. 

2 Data 

The National Reference Center for Salmonella contributes to the surveillance 
of salmonellosis by performing serotyping of about 10000 clinical isolates each 
year. Salmonella surveillance is based on a network of 1500 medical labo- 
ratories that voluntarily send their isolates. Salmonella enterica serotypes 
Thyphimurium and Enteritidis represent 70 % of all Salmonella isolates in 
humans among many hundreds of serotypes; that is why we consider in this 
paper mainly these two serotypes. For illustrative purpose, four other less fre- 
quent serotypes (Manhattan, Derby, Agona and Virchow) might also be con- 
sidered; Figure [T] shows the weekly number of isolates for these six serotypes 
from January 1, 1995 to December 31, 2008. It highlights the great vari- 
ability in terms of seasonality, secular trend and weekly number of reported 
isolates and frequencies of unusual events. 
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Let {Y t ; t e KT} be the time series corresponding to the number of isolates at 
time point t for a given serotype. As mentioned by many authors (see e.g. 
[15], [9]), seasonal effects may have a strong impact to generate a statistical 
alarm. A usual way to prepare dataset is to select counts from comparable 
periods in past years, as described in the literature (see [21], [6]). The dataset 
is restricted to the counts that occurred during the times within these com- 
parable periods. For instance, if the current time is t of year y, then only the 
counts for the n = b(2w + 1) times from t — w to t + w of years from y — b 
to y — 1 (b > 1, w > 1) are used. 

From now on, (X t ) will denote the resulting times series that will be con- 
sidered in our study. As an illustration Figure 2 represents the restricted 
dataset for Salmonella Typhimurium, for a given current week. 



Suppose we have at our disposal n successive observations that we consider as 
realizations of a sample (Xj) of independent and identically distributed (i.i.d.) 
non-negative random variables defined on a probability space (Q, A, P), from 
a distribution function F. 

Recall that a return level zt associated with a given return period T corre- 
sponds to the level expected to be exceeded on average once every T time 
units, i.e. such that 



where H{A} represents the indicator function that is equal to 1 if A is true and 
to otherwise. The last equality can be rewritten as 1 — F(zt) = 1/T. Hence, 
the return level zt corresponds simply to a pt~ quantile with pt = 1 — 1/T, 
zt = F*~(l — 1/T), F^ denoting the generalized inverse function of F. 
The idea of the method is to associate with each observation x s a return 
period T s defined theoretically as (1 — F{x s ))~ 1 to be able to determine the 
return period T t associated to each new observation xt at time t, then to look 
backwards (and not forwards as in the standard way) in the interval (t — T t ; t) 
for the existence of an observation that would exceed x t ; if it exists, we gen- 
erate a warning time at time t since on average we were not expecting a 
second exceedance on (t — T t ; t]. 

Notice that in our discrete framework it will not be possible to estimate ex- 
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plicitly the return levels; instead estimated bounds proposed in [TT] will be 
considered. 

Therefore, after a preliminary analysis of the data and definition of our sam- 
ple, we will compute the estimated bounds of the return levels in order to 
obtain a graph of the return periods and levels. Then, we will allocate a 
return period to any new observation x t to test if t corresponds to a warning 
time according to our definition. 



3.1 Bounds for the return level 

Looking at extremal events leads us to the crucial problem of high quantile 
estimation. Such a purpose has been extensively studied in the literature (see 
e.g. [H]), and the classical approach, in the i.i.d. setting, consists to use the 
Extreme Value Theory assuming that exceedances above a high threshold 
approximately follow a Generalized Pareto distribution (this result is known 
as the Peak-Over-Threshold (POT) method). However, this theory is only 
valid in the case where the underlying distribution function F is continuous. 
This is not the case in the epidemiology context. Therefore, we propose to 
use instead upper and lower bounds for the return level z t and estimate them, 
following the method developed by Guillou et al. (see [H]); this method has 
several advantages: the upper and lower bounds can be computed for any 
value of t (in particular it holds for large values), it does work for both small 
and large samples, and for F continuous or discrete. So this approach is 
well- adapted to our context, when assuming the random variables associated 
to the observations i.i.d. 

Let us recall the expression of those upper and lower bounds, given respec- 
tively by 

inf|&t(u, v) : u > 0, v > non-decreasing functions! (1) 
where b t (u,v) := ( ^'^M ; wit h 9(u,v) := E[u(X)v(F(X))}, 



and 



v(l - l/t) 



sup|ft(M, w, q) : u > non- decreasing function,!/; > non-increasing function, q > 1 

(2) 
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where 



/ a* 



£ t (u,w,q) := 



V 



w(l/t)(l-l/t) 



l/q' 



9*(u, w) := E [u{X)w{l - F(X))\ . 
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Estimators of those two bounds (CQ) and (J2J) follow when considering natural 
estimators of 9(u,v) and 9*(u,w), namely 



b t (u,v) = u* 



t9 n (u,v) 
v(l - l/t) 



and £t(u, w, q) = w 4 



9* n (u, w) - t~ l+l l q [6*(u q , w q ) 



1/9 



w(l/t)(l-l/t) 



(3) 

where, if (Xi >n < ... < X n n ) denote the order statistics from a given sample, 
0„(u,u) = - y^u(X in )v (-) and ^(tt, w) = - Y] u(X in )w( 1 



i=l 



i=l 



Under some conditions on u, v and to, asymptotic distributions are obtained 
for the bounds estimators, as well as asymptotic confidence intervals when 
using the delta method (see Section 3 in [TTj). 

For instance, concerning the upper bound, we have the following asymptotic 
confidence interval: 



b t (u,v) E 



b t (u,v) ± - ,', —(u^ 



nv(l - \y 1 \v(l - i) 



(4) 



where qi- a /2 denotes the quantile of order 1 — a/2 of the standard normal 
distribution and a the empirical version of a defined for U uniformly dis- 
tributed on (0, 1) as 



a 2 =E\ -v(U)u(F < -(U)) + 9(u,v) 



J (t {u < t} -ty{t)u{F^{t))dt 



A similar confidence interval can be obtained for the lower bound. 



Finally, since it is impossible to optimize in ([T]) and ([2]) under the whole family 
of non-negative and non- decreasing functions u and v and non-negative and 
non- increasing functions w, we reduce the problem by choosing the sub-class 
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of power functions since it seems adapted to our study, giving reasonable 
results (even if not optimal). 

We consider the functions defined by u(x) = x a , v(x) = x 13 , w(x) = x~ u , 
with a, /3, v positive real numbers and set q = 2 (changing the value of this 
last parameter does not affect significantly the final result). Then we solve 
numerically, for e close enough to 0, the following optimization problem 

(a t , Pt) = argmin \^b t {x a ,x 13 ) : a G [e,a ],/3 G [e,/3 ]| and 
(a t *,z//) = argmax |? t (x Q , x~ u , 2) : a G [s,a ],u G [e,^o]|, (5) 

in order to obtain the estimated upper and lower bounds equal, respectively, 
to 

% = b t {x*\x Pt ) and f t =f ( (/,i" i; ' , ,2). (6) 

As already said, the choice for u, v, w and q does not necessarily correspond to 
the optimal bounds but covers a wide enough range of bounds that provides 
satisfying results when working on various epidemiology datasets, as we are 
going to see in Section 4. 



3.2 Determination of an alarm time 

Let us present our method to define an alarm system. 
It will consist in three main steps. 

Note that using bounds for a return level zt will imply that the return period 
defined theoretically by (1 — F(^))~ 1 cannot be explicitly estimated and we 
have 

Te < (1 — F(z t ))~ l < T b (7) 

where Tg and T b denote the return periods of the bounds £ t (u,w,q) and 
bt(u,v) respectively. 

Step 1: We draw the plot of the return period on the x-axis and the cor- 
responding estimate of the upper bound of the return level (instead of the 

return level itself): ft, b^j ■ 

Step 2: We allocate to each observation x ti , i > 1, a return time Tj us- 
ing the previous plot. Namely, x ti corresponds to a value b T . of the y-axis 
of the plot from which we deduce the associated return level Tj. Reading an 
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observation as an upper bound of a return level means that Tj is in fact the 
lower bound of the theoretical return period (1 — T(x t .)) _1 that should be 
associated to the observation x ti , because of ([7]). 

We justify our choice as follows. Considering £ T . instead of in the above 
method would have led to overestimate the return period associated to the 
observation x ti , which could be a problem in the context of alarms (it is 



better to have more alarms than less), except if the plots It, it) and [t,b t 



were close enough, but it is generally not the case for our datasets where t t 
appeared approximately as a constant function of time (see [2]). 

Step 3: We use the fact that if (X,) are i.i.d. random variables, then 
we have for any time interval / (T) with length T 



This remark is important since we want to define for each new observation 
a warning time and not a predicted return period, which means to look 
backward in time. 

Hence for each new observation xt t , to which a return time Tj has been 
associated (via Step 2), we will look in the interval (tj — T; tj) to see if there 
exists an observation exceeding x ti ] if it does, we ring an alarm at this time 
tj, since as already said, on average, we do not expect a second exceedance 
on (tj — Tf,ti]. Notice that we chose here to consider an inequality in the 
indicator set, and not a strict one as in (jSJ), the level corresponding now to 
an observation. 

To finish this section, let us sumarize our method. 



• Using the plot ( t, b t ) , associate a time Tj to each observation x ti {i > 1); 



• for each new observation x tl , consider / (Tj) = (tj — Tj, tj]; 

• look for the existence of an observation Xt > b^, for t G (tj — Tj, tj); 
if there exists at least such an observation, i.e. 









{Xj> 6 T J 



> 1, then generate a warning time at tj. 



Ve/(Ti) 
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Now to illustrate our method, let us consider the example of the maximum 
number of Salmonella Virchow isolates. In Figure 3, the x-axis corresponds 
to the values of t from 2 to 100 weeks and the y-axis to b t \ the two dashed 
lines indicate the 95% confidence interval bounds of bt- 

The return level/return period graphs for the six serotypes presented in Sec- 
tion 2 are represented in Figure 4. 

4 Applications 

For each week from January 2000 to December 2008, the EVT method was 
applied to the six time series presented in Section 2. For any week t, counts 
of weeks from t — 3 to t + 3 of years from y — 5 to y — 1 were used. Moreover, 
in order to reduce the probability that an alarm could be triggered for few 
sporadic standard rule has been adopted to keep an alarm at week t 

if at least 5 cases were observed during the 4 last weeks preceding t. This rule 
has already been applied at the Communicable Disease Surveillance Center 
(CDSC) using the method developed by Farrington et al. (see [6]). 
Several statistical methods for the prospective change-point detection in time 
series of counts have been already applied to Salmonella infections by various 
authors (see [23] , [B] , [T3] ) . We chose to compare the alarms generated by the 
EVT method with the ones generated by the Farrington method for the six 
serotypes and for each week since 2005, as the National Reference Center for 
Salmonella applies the Farrington method. 

Comparisons between the two methods are shown in Table 1. The concor- 
dance is equal to 93.7% (Typhimurium), 95.9% (Derby), 96.8% (Agona), 
97.4% (Virchow), 98.7% (Enteritidis) and 99.1% (Manhattan). When com- 
paring the results outside of the diagonal with the real past epidemies, it 
appears that our method seems to fit better the reality than Farrington's 
one, providing less alarms when historically there was indeed no epidemy, 
and more alarms when there was. It makes our method quite promising. 
Nevertheless this comparison cannot replace a more standard procedure that 
systematically compare for each week the statistical alarms and the alerts 
identified by an epidemiologist. In this case, the epidemiologist judgement is 
considered as the gold standard (even if the epidemiologist is not infallible). 
Figures 5 and 6 represent both the alarms and the weekly counts over time for 
the serotypes Manhattan and Agona. Each triangle represents a statistical 
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Manhattan 



Derby 



EVT 

+ Total 
440 3 443 
Farrington + 1 19 20 

Total 441 22 463 



EVT 

+ Total 
441 12 453 
Farrington + 7 3 10 

Total 448 15 463 



Agona 

EVT 
+ Total 
427 13 440 
Farrington + 2 21 23 
Total 429 34 463 



Virchow 

EVT 
+ Total 
451 1 452 
Farrington + 11 11 
Total 462 1 463 



Typhimurium 

EVT 
+ Total 
418 1 419 
Farrington + 28 16 44 
Total 446 17 463 



Enteritidis 

EVT 

+ TbtaT 
455 455 
Farrington + 6 2 8 

Total 461 2 463 



Table 1: Two-way tables of frequency counts of non-alarms (-) and alarms 
(+) from the Farrington method and the EVT method, for each serotype. 



alarm. Triangles on the first line represent the alarms generated by the EVT 
method, whereas the ones generated by the Farrington method are given on 
the second line. 

In Figure 5, the alarms generated by the two methods occurred in the same 
period that corresponds to a documented outbreak, delimited by the dashed 
lines, for the serotype Manhattan (see [IS]). From August 2005 to Febru- 
ary 2006, a community- wide outbreak of Salmonella Manhattan infections 
occurred in France. The investigation incriminated pork products from a 
slaughterhouse as being the most likely source of this outbreak. There was 
a concordance between the temporal (October-December 2005) and the geo- 
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graphical (south-eastern France) occurrence of the majority of cases and the 
distribution of products from the slaughterhouse. 

In Figure 6, alarms for the serotype Agona are distributed from 2000 to 2008. 
There is a concordance between the two methods during three periods. The 
first period, corresponding to 5 weeks in August and September 2003, was not 
documented as an outbreak. The second concordance period, corresponds to 
15 consecutive weeks from the last week in December 2004 to week 15 in 
April 2005. This second period is more interesting as it corresponds to the 
beginning of a large outbreak of infections in infants linked to the consump- 
tion of powdered infant formula (see [3]). The outbreak period, delimited by 
the two dashed lines, took place in two stages: the first stage from week 53 
in December 2004 to week 10 in March 2005 and the second from week 11 to 
week 21. A total of 47 cases less than 12 months age were identified during 
the first stage and 94 cases less than 12 months age were identified during 
the second stage. The third period corresponds to the week 29 in July 2008. 
It included five cases, two of them coming from a foodborne disease outbreak 
involving piglet consumption, and the three others being probably sporadic 
cases. 

The EVT was implemented using R version 2.9 [T?] : see [2]. The function 
called algo.farrington, implemented in the R-Package 'surveillance' ([12]) was 
used to apply the Farrington method. 

5 Discussion 

We believe that the EVT method meets a number of requirements, listed 
by Farrington et al. (see [7]), for the outbreak detection algorithms imple- 
mented in surveillance systems. Indeed, this method is able to monitor a 
large number of time series which became an absolute necessity in modern 
computerized surveillance systems. It can deal with a wide range of events 
as it is the case for the Salmonella infections with the routinely analyses of 
several hundred serotypes per week. It can handle times series with great 
numbers of cases (such as Salmonella Enteritidis) or small numbers of cases 
(such as Salmonella Manhattan). Seasonality is taken into account by com- 
paring counts over the same periods of time. Other methods propose a direct 
way to treat the past aberrations, for instance by associating low weights to 
the weeks coinciding with past outbreaks. There is no such a need when 
using the EVT method since the return period is not a constant but depends 
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on each observed count; alarms can then be generated even if past outbreaks 
exist. It is particularly the case with low counts for which the return period 
is small and does not include the past outbreaks. Finally, the method is 
implemented in a function using the R language, allowing to run it in an 
automated procedure with minimal user intervention. 

Although the model developed by Farrington et al. (see [B]) became a stan- 
dard reference method, routinely used in France since many years and in- 
corporated in several surveillance systems: human Salmonella, non human 
Salmonella (see [1]), legionella (see [10]) or in syndromic surveillance systems 
(see [8]), the EVT method seems also to be a valuable and interesting tool 
for the recognition of time clusters. It could be integrated in the family of 
outbreak detection algorithms used by the public health surveillance agencies 
since developing effective computer-assisted outbreak detection systems still 
remains a necessity to ensure timely public health intervention. 
Another possible way to proceed would be to transform the sample of discrete 
random variables into a continuous one in order to use standard EVT tools, 
instead of quantile's bounds. It has been empirically studied when smooth- 
ing the data via a kernel transformation and provided promising results (see 
[2]); it will be the subject of a future work. 

Finally, we also plan to investigate the extension of such EVT methods for 
time or spatially dependent data. 
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Figure 1: Weekly counts of isolates reported to the National Reference Cen- 
ter for Salmonella in France, January 1, 1995 to December 31, 2008: (a) 
Salmonella Manhattan; (b) Salmonella Derby; (c) Salmonella Agona; (d) 
Salmonella Virchow; (e) Salmonella Typhimurium; (f) Salmonella Enteri- 
tidis. 
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Figure 2: Weekly counts of isolates for Salmonella Typhimurium. The cur- 
rent week, represented by the (blue) arrow, is the last week of December 
2008. The counts that occurred in comparable periods (± 3 weeks) in the 
five previous years, and used to generate or not an alarm, are represented by 
the (red) striped bands. 
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Figure 3: The return level/return period graph for Salmonella Virchow. The 
black curve represents the upper bound of the return level. Dashed curves 
represent the 95% confidence interval of this upper bound. To the observation 
y = 20 (respectively y = 15) does correspond on the x-axis b 9 i (respectively 
6 7 ) from which we deduce the return period equals to T = 91 (respectively 
T = 7) weeks. 
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Figure 4: The return level/return period graphs calculated for the last week in 
2008: (a) Salmonella Manhattan; (b) Salmonella Derby; (c) Salmonella Ag- 
ona; (d) Salmonella Virchow; (e) Salmonella Typhimurium; (f) Salmonella 
Enteritidis. 
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Figure 5: Salmonella Manhattan: Weekly counts from January 1, 2000 to 
December 31, 2008. Roman numerals refer to the quarters of the years. 
Alarms generated by the EVT method are represented by triangles on the 
first line. Alarms generated by the Farrington method are represented by 
triangles on the second line. The documented outbreak is delimited by the 
two dashed lines. 
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Figure 6: Salmonella Agona: Weekly counts from January 1, 2000 to De- 
cember 31, 2008. Roman numerals refer to the quarters of the years. Alarms 
generated by the EVT method are represented by triangles on the first line. 
Alarms generated by the Farrington method are represented by triangles on 
the second line. The documented outbreak is delimited by the two dashed 
lines. 
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